Skip to content

fix(pegboard): persist and replay hibernating requests#4656

Draft
NathanFlurry wants to merge 1 commit intobreak-up/keep-error-exposure-consistentfrom
break-up/persist-hibernating-requests
Draft

fix(pegboard): persist and replay hibernating requests#4656
NathanFlurry wants to merge 1 commit intobreak-up/keep-error-exposure-consistentfrom
break-up/persist-hibernating-requests

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 14, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

PR Review: fix(pegboard): persist and replay hibernating requests

Note: This is a draft PR. The description/test plan are not filled in.

Summary

This PR fixes a bug where hibernating WebSocket request IDs were not being persisted and replayed when an actor restarts. The old approach relied on a gateway guard intercepting CommandStartActor and populating hibernating_requests before it reached the runner, but this was not working correctly. The fix moves the population logic into insert_and_send_commands activities and the keepalive task so requests are correctly carried across restarts.


Issues

Correctness

1. Duplicate UDB fetch in actor2/runtime.rs (double population)

In actor2/runtime.rs, send_outbound fetches hibernating requests and sets them on the CommandStartActor before calling InsertAndSendCommands::run. Then insert_and_send_commands also fetches and overwrites hibernating_requests for every CommandStartActor. The second fetch overwrites the first, making the fetch in send_outbound redundant. This adds unnecessary UDB load. One of the two sites should be removed.

2. mk1 runner path is not covered

actor/runtime.rs has mk1 (non-mk2) runner code paths (e.g. the ctx.signal(crate::workflows::runner::Command { ... }) calls) that still send hibernating_requests: Vec::new() and do not go through insert_and_send_commands. If mk1 runners support hibernation, this is a bug regression. If they don't (or are deprecated for this use case), a comment explaining why would help.

3. Non-deterministic read inside a Gasoline activity

insert_and_send_commands now performs a live UDB read (hibernating_request::list) inside the activity before writing. Gasoline activities can be retried, and the read result could differ between attempts (e.g., a keepalive expired between retries). This could cause non-deterministic replay behavior. Worth evaluating whether this read should be moved outside the activity boundary or made part of the activity input instead.


Code Quality

4. Stale/misleading comments

Multiple call sites in actor/runtime.rs and actor2/runtime.rs that construct CommandStartActor with hibernating_requests: Vec::new() still carry the old comment:

// Empty because request ids are ephemeral. This is intercepted by guard and populated before it reaches the runner

Since the interception now happens inside insert_and_send_commands (not the guard), these comments are misleading and should be updated to reflect the new behavior.

5. tracing::info! in high-frequency hot paths

The new log lines in upsert.rs, delete.rs, and list.rs use tracing::info!. The keepalive loop calls upsert on every ping interval tick, so on systems with many hibernating actors this will produce high-volume info-level output. tracing::debug! is more appropriate for the per-tick upsert and list operations (consistent with the existing tracing::debug! calls in the keepalive loop body). tracing::info! is fine for delete since that's less frequent.


Minor

6. Unrelated change in envoy-client/connection.rs

The addition of "rivet_target.envoy" as a WebSocket subprotocol appears unrelated to the hibernating request persistence fix. It should either be explained in the PR description or split into a separate commit.

7. No tests added

The test checklist item is unchecked. Given this is a distributed systems bug fix involving actor restart and request replay, integration tests would add significant confidence. Even a basic test that hibernates a WebSocket connection, restarts the actor, and verifies the request is replayed would be valuable.


Suggestions

  • Remove the redundant hibernating_requests fetch from send_outbound in actor2/runtime.rs (issue 1).
  • Update or remove the stale // intercepted by guard comments at all CommandStartActor construction sites (issue 4).
  • Downgrade the new tracing::info! calls in upsert.rs and list.rs to tracing::debug! (issue 5).
  • Add a comment on the mk1 paths explaining whether they need this fix or why they don't (issue 2).
  • Consider the Gasoline replay semantics for the UDB read inside the activity (issue 3).

@NathanFlurry NathanFlurry force-pushed the break-up/serialize-gateway-actor-keys branch from 9550912 to 3948546 Compare April 15, 2026 02:40
@NathanFlurry NathanFlurry force-pushed the break-up/persist-hibernating-requests branch from 9191b42 to 88f4613 Compare April 15, 2026 02:40
@NathanFlurry NathanFlurry changed the base branch from break-up/serialize-gateway-actor-keys to graphite-base/4656 April 15, 2026 02:50
@NathanFlurry NathanFlurry force-pushed the break-up/persist-hibernating-requests branch from 88f4613 to ae4fe54 Compare April 15, 2026 02:50
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4656 to break-up/keep-error-exposure-consistent April 15, 2026 02:51
@NathanFlurry NathanFlurry force-pushed the break-up/keep-error-exposure-consistent branch from ae2a18f to feb2f40 Compare April 15, 2026 06:55
@NathanFlurry NathanFlurry force-pushed the break-up/persist-hibernating-requests branch from ae4fe54 to c862fec Compare April 15, 2026 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant